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Abstract 



We investigate the problem of finding reverse nearest neighbors efficiently. Although provably 
good solutions exist for this problem in low or fixed dimensions, to this date the methods 
proposed in high dimensions are mostly heuristic. We introduce a method that is both provably 
correct and efficient in all dimensions, based on a reduction of the problem to one instance of e- 
nearest neighbor search plus a controlled number of instances of exhaustive r-V££B, a variant of 
Point Location among Equal Balls where all the r-balls centered at the data points that contain 
the query point are sought for, not just one. The former problem has been extensively studied 
and elegantly solved in high dimensions using Locality-Sensitive Hashing (LSH) techniques. By 
contrast, the latter problem has a complexity that is still not fully understood. We revisit the 
analysis of the LSH scheme for exhaustive r-VC£B using a somewhat refined notion of locality- 
sensitive family of hash function, which brings out a meaningful output-sensitive term in the 
complexity of the problem. Our analysis, combined with a non-isometric lifting of the data, 
enables us to answer exhaustive r-VC£B queries (and down the road reverse nearest neighbors 
queries) efficiently. Along the way, we obtain a simple algorithm for answering exact nearest 
neighbor queries, whose complexity is parametrized by some condition number measuring the 
inherent difficulty of a given instance of the problem. 

1 Introduction 

Proximity queries are ubiquitous in science and engineering, and given their natural importance 
they have received a lot of attention from the computer science community [H [TUl \T7\ [29] . Nearest 
Neighbor (J\fJ\f) search is certainly among the most popular ones. Given a finite set P with n 
points sitting in some metric space (X, d), the goal is to preprocess P in such a way that, for any 
query point q G X, a nearest neighbor of q among the set P\{q} can be found quickly. The AfAf 
query can be easily answered in linear time by brute force search, so the algorithmic challenge is 
to preprocess the data points so as to find the answer in sub-linear time. Numerous methods have 
been proposed, however their performances degrade significantly when the dimensionality d of the 
data increases — a phenomenon known as the curse of dimensionality. Typically, they suffer from 
either space or query time that is exponential in d , and so they become no better than brute-force 
search when d becomes higher than a few dozens or hundreds [34] . 
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In light of the apparent hardness of J\fJ\f search, an approximate version of the problem called 
e-NN has been considered, where the answer can be any point of P \ {q} whose distance to q is 
within a given factor (1+e) of the true nearest neighbor distance [3l[Tlll8 | [2"2" l l26|. Inspired from the 
random projection techniques developed by Kleinberg |22j . Indyk and Motwani [18] and Kushilevitz 
et al. [26] proposed data structures to answer e-MM queries with truly sublinear runtime and fully 
polynomial space complexity. The approach developped in [TH] is based on the idea of Locality- 
Sensitive Hashing (LSH), which consists in hashing the data and query points into a collection of 
tables indexed by random hash functions, such that the query point q has more chance to collide 
with nearby data points than with data points lying far away. This technique solves a decision 
version of the e-NN problem called Point Location among Equal Balls ((r,e)-V ££B), which asks 
to decide whether the distance of q to P \ {q} is below a given threshold r or above r(l + e). The 
output is proven correct with high probability, and the query time is bounded by 0{dn e ~po\y\og n) 
for some constant q = 1+ q( e ) . Moreover, Indyk and Motwani [18] proposed a reduction of e-AfAf 
search to a poly-logarithmic number of (r, e)-VC£B queries, thus providing a fully sublinear-time 
and polynomial-space procedure for solving e-NN . Although originally designed for the Hamming 
cube, LSH was later extended [21 [Til E5] to affine spaces U. d equipped with ^ s -norms, s 6 (0, 2]. 

In this paper we mainly focus on the reverse problem, known as Reverse Nearest Neighbors 
(JZJ\fN) search. Given a finite set P with n points sitting in some metric space (X, d), the goal is 
to preprocess P in such a way that, for any query point q € X , one can find the influence set of q, i.e. 
the set 1ZNN p{q) formed by the points p 6 P\ {q} that are closer to q than to P\{p}. Such points 
are called reverse nearest neighbors of q. 1ZNN queries arise in many different contexts, and it is 
no surprise that they have received a lot of attention since their formal introduction by Korn and 
Muthukrishnan [23] . A wealth of methods have been proposed [Illl[9l[I2[IB[23l[2l2[30l[^ 
which behave well in practice on some classes of inputs. However, these methods are mostly 
heuristic, and to date very little is known about the theoretical complexity of 1ZNN search, except 
in low [51 [27] or fixed [6] dimensions, where the dimensionality of the data can be considered as 
a mere constant. The crux of the matter is that, in contrast to (s-)NN search, the answer to an 
1ZMM query is not a single point but a set of points, whose size can be up to exponential in the 
ambient dimension [28], so there is no way to achieve a systematic sub-linear query time. Ideally, 
one would like to achieve a query time of the form 0(n g + \TZMAfp(q)\), where q is a constant less 
than 1 and \1ZNN p{q)\ is the size of the reverse nearest neighbors set. The big-0 notation may 
hide extra factors that are polynomial in d and poly-logarithmic in n. Intuitively, the first term in 
the bound would represent the incompressible time needed to locate the query point q with respect 
to the point cloud P, as in a standard J\fJ\f query, while the second term would represent the size 
of the sought-for answer. 

Our contributions. Our main contribution (see Section IHb is a reduction of IZAfAf search to 
one instance of e-MM search plus a poly-logarithmic number of instances of exhaustive r-V££B, 
a set-theoretic version of VLEB where not only one r-ball containing the query point q is sought 
for, but all such balls. Our reduction is based on a partitioning of the data points into buckets 
according to their nearest neighbor distances, combined with a pruning strategy that prevents the 
inspection of too many buckets at query time. 

Turning our reduction into an effective algorithm for 1ZNN search requires to adapt the LSH 
scheme to solve exhaustive r-VCEB queries. Such an adatptation was proposed in [291 Chapter 1], 
with expected query time 0(n s + n Q \Bp(q, r(l + e))|), where g = 1+( L £ ^ and where e > is a 
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user-defined parameter. Even though the ouput of the query is the set Bp(q,r), the query time 
depends on the size of the superset Bp(q, r(l + e)), and when choosing e the user must find a 
trade-off between increasing the size of Bp(q, r(l + e)) and increasing the average retrieval cost n e 
per point of Bp(q, r(l+e)). In Section[3]we revisit the analysis of [29, Chapter 1] using a somewhat 
finer concept of locality-sensitive hashing (see Definition 3.1), which enables us to quantify more 
precisely the amount of collisions with the query point that may occur within the hash tables stored 
in the LSH data structure. Taking advantage of this refined analysis, we propose a simple extra 
preprocessing step that reduces the average retrieval cost per point of Bp(q, r(l+e)) down to n a for 
some constant a < eg < e, thereby making the previous trade-off no longer necessary. The price to 
pay is a slight degradat ion o f the absolute term n Q in the complexity bound, which rises to n e where 
d = i+e(e 2 ) (Theorem 3.7). All in all, the query time bound becomes 0(n e + n a \Bp(q, r(l + s))|) 
and therefore remains sublinear in n as long as \Bp(q,r(l + e))| < n 1_e . Intuitively, our extra 
preprocessing step consists in lifting the point cloud P and query point q one dimension higher 
through some highly non-isometric embedding, so that the induced metric distortion moves q away 
from P and further concentrates the distribution of the distances to q around the parameter value 
r, thereby reducing the total number of collisions with q within the hash tables. The output of the 
query can still be proven correct thanks to the fact that the embedding preserves the order of the 
distances to q. This approach stands in contrast to the general trend of applying low-distortion 
embeddings to solve proximity queries. 

Down the road, these advances lead to an algorithm for solving 1ZMM queries with high proba- 
bility in expected 0{\n l ^ 1+e ^ + n £ \0{e)-KMM P {q)\) time using fully polynomial space, where 
e > is a user-defined parameter and 0{e)-lZMM p{q) is a superset of 1ZMM 'p(q) whose points are 
0(e)-close to being true reverse nearest neighbors of q (Theorem 5.3). To the best of our knowledge, 
this is the first algorithm for answering 1ZMM queries that is provably correct and efficient in all 
dimensions. Furthermore, the algorithm and its analysis extend naturally to the bichromatic set- 
ting where the data points are split into two disjoint categories, e.g. clients and servers, a scenario 
that is encountered in various applications |23j . 

Along the way, in Section [4] we obtain a simple algorithm that can answer exact MM queries 
in expected 0(n 1 /( 1 + e ( s )) + n £ \0(e)-MM p(q)\) time using fully polynomial space, where e > 
is a user-defined parameter and 0(e)-MM p(q) is a set of approximate nearest neighbors of q 
(Theorem 4.3). The first term in the running time bound corresponds to a standard e-MM query, 
while the second term is parametrized by the size of 0{e)-MM p(q), which thereby plays the role 
of a condition number measuring the discrepancy in difficulty between the exact and approximate 
MM queries on a given instance. Note that our algorithm is not expected to perform as well 
as state-of-the-art techniques in growth-restricted spaces [HI EH EH El] > however its complexity 
bounds hold in a more general setting and its sublinear behavior on a particular instance relies on 
the weaker hypothesis that the condition number of this instance lies below the threshold n l ~ £ . In 
the same spirit, Datar et al. [H] designed a lightweight version of our algorithm that only works in 
Euclidean spaces but is competitive with [161 Ell El] • 

Throughout the paper, the analysis is carried out either in full generality in metric spaces that 
admit locality-sensitive families of hash functions, or more precisely in (M. d ,£ s ) when liftings of the 
data one dimension higher come into play. The case of the d-dimensional Hamming cube is also 
encompassed by our analysis since this space embeds itself isometrically into (R , -^i). 
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2 Preliminaries 



In Section 2.1 we introduce some useful notation and state the nearest neighbor and reverse nearest 
neighbors problems formally. In Sections 2.2 through 2.4 we give an overview of LSH and its 
application to approximate nearest neighbor search, with a special emphasis on the case of affine 

-norms in Section 



spaces M equipped with 



2.4 



The data structures and algorithms introduced 



in this section are used as black-boxes in the rest of the paper. 



2.1 Problem statements and notations 

Throughout the paper, (X, d) denotes a metric space and P a finite subset of X. Given a point 
x £ X, let d(x, P) denote the distance of x to that is: d(x, P) = min{d(a;,p) | p G P\ {x}} . 

Given a parameter r > 0, let B(x, r) denote the metric ball of center x and radius r, and let Bp(x, r) 
be the set of points of P \ {x} that lie within this ball. Then, Bp(x, d(x, P)) is the set of nearest 
neighbors of x among P \ {x}, noted MM p{x). By analogy, given a parameter e > 0, e-MMp(x) 
denotes the set Bp(x, (l + e)d(x, P)) of e-nearest neigbors of x among P\{x}. The usual convention 
is that point x itself is excluded from these sets, which is not mentioned explicitly in our notations 
for simplicity but will be admitted implicitly throughout the paper. 

Problem 1 (MM). Given a query point q G X, the nearest neighbor query asks to return any 
point of MM p(q) . 

Problem 2 (e-MM). Given a query point q G X, the e-nearest neighbor query asks to return any 
point of e-MM p(q) ■ 

Given now a point x £ X, let IZMMp(x) denote the set of reverse nearest neighbors of x among 
P \ {x}, which by definition are the points p G P\ {x} such that x G MMp\j{ x }(p). By analogy, let 
e-lZMM p(x) denote the set of reverse e-nearest neighbors of x among P \ {x}, which by definition 
are the points p £ P \ {x} such that x G e-MMp u r x \(p). Here again, point x itself is excluded from 
the various sets, a fact omitted in our notations for simplicity but admitted implicitly. 

Problem 3 (JZMM). Given a query point q G X, the reverse nearest neighbors query asks to 
retrieve the set 1ZMM p(q). 



2.2 Reducing approximate nearest neighbor search to its decision version 

Given a parameter r, the decision version of Problem [T] consists in deciding whether d(q, P) is 
smaller or larger than r. This problem is also known as Point Location among Equal r -Balls (r- 
VC8B) in the literature, because it is equivalent to deciding whether q lies inside the union of balls 
of same radius r about the points of P. It is formalized as follows: 

Problem 4 (r-VC£B). Given a query point q G X , the r-VCEB query asks the following: 

• if d(q, P) < r, then return YES and any point p G P such that d(p, q) < r; 

• else (d(q,P) > r), return NO. 

By analogy, the decision version of Problem [2] consists in deciding whether d(q, P) is smaller 
than r or larger than r(l + e). If it lies between these two bounds, then any answer is acceptable. 
The formal statement is the following: 
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Problem 5 ((r, e)-V C8B). Given a query point q £ X, the (r, e)-VCEB query asks the following: 

• if d{q, P) < r, then return YES and any point p G P such that d(p, q) < r(l + e); 

• if d{q, P) > r(l + e), then return NO; 

• else (r < d(q, P) < r(l + e)), return any of the above answers. 

The original LSH paper |18] showed a construction that reduces the e-NN problem to a log- 
arithmic number of (r, e)-VC£B queries. Other reductions have since been proposed, and in this 
paper we will make use of the following one, introduced by Har-Peled [H], which is simple and 
works in any metric space. It is based on a divide-and-conquer strategy, building a tree T(P, e) 
of height O(lnn), such that each node v is assigned a subset P v C P and an interval [r v ,R v ] of 
possible values for parameter r. Each e-NN query is performed by traversing down the search 
tree T(P,e), and by answering two {r,e)-V C£B queries at each node v to decide (approximately) 
whether d(q, P) belongs to the interval [r^-R^] or not: in the former case, a simple dichotomy on 
a geometric progression of values of r within the interval makes it possible to determine within a 
relative error of 1 + e where d(q, P) lies in the interval, and to return a point of e-NN p{q), with a 
total number of (r, e)-V LEB queries bounded by 0(log 2 log 1+£ y 31 ); in the latter case, the choice of 
the child of v in which to continue the search is determined from the output of the two (r, e)-VCEB 
queries. In this construction, the ratio ^ is guaranteed to be at most a polynomial in J, with 
bounded degree, so we have 0(log 1+£ = 0(log 1+e f ) = 0{\ In f ). Thus, 

Theorem 2.1 (see [II]). Given a finite set P C X with n points, the tree T(P, e) stores 0(Mn ") 
data structures for (r,e)-V C£B queries per node, and it reduces every e-NN query to a set of 
0(lnn + \n\ +lnlnf ) = 0(ln f ) queries of type (r, e)-VC£B. 

2.3 Solving (r, e)-VCSB queries by means of Locality-Sensitive Hashing 

Definition 2.2. Given a metric space (X, d) and two radii r\ < r2, a family T = {/ : X — > Z} of 
hash functions is called (r%, r2,pi,]?2)-sensitive if there exist quantities 1 > p\ > p2 > such that 
\/x,y E X, 

• d(x,y) <ri => Pr{/(x) = f(y)} >pi, 

• d(x,y) > r 2 =>• Pr{/(x) = f(y)} < p 2 , 

where probabilities are given for a random choice of hash function f £ T according to some proba- 
bility distribution over the family. 

Intuitively, a (ri, r2,pi,i?2)-sensitive family of hash functions distinguishes points that are close 
together from points that are far apart. 

Assuming that a (r, r(l + e),pi,p2)-sensitive family T of hash functions is given, it is possible 
to answer {r,e)-VC£B queries in sub-linear time |13L 118] . The algorithm proceeds as follows: 

• In the pre-processing phase, it boosts the sensitivity of the family T by building fc-dimensional 
vectors g = • • , fk) '■ X — > Z fc whose coordinate functions are drawn independently 
at random from T . The hash key of a point x £ X is now a /c-dimensional vector g{x) = 
(fi(x), ■ ■ ■ , fk(x)), and two keys g(x) and g(y) are equal if and only if fi{x) = fi(y) for all 
i = 1, • • • , k. Call Q the family of such random hash vectors. The algorithm draws L elements 
gi, ■ ■ ■ ,gi independently from Q, and it builds the L corresponding hash tables Hi, ■ ■ ■ ,Hl. 
It then hashes each data point p £ P into every hash table Hi using vector gi(p) as the hash 
key. 
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• In the online query phase, the algorithm hashes the query point q into each of the L hash 
tables, and it collects all the points colliding with q therein, until either some point p 6 
Bp(q, r(l + e)) has been found or more than 3L points (including duplicates) have been 
collected in total. In the former case the algorithm answers YES and returns p, while in the 
latter case it answers NO. It also answers NO if no point of Bp(q,r(l + e)) has been found 
after visiting all the hash tables. 

Letting k = [" hly^ ! an< ^ L = where g = one can prove that this procedure gives the 

correct answer with constant probability |13|. I18j. By repeating it whin times, for a fixed constant 
uj > 0, one can increase the probability of success to at least 1 — -p. Thus, 

Theorem 2.3 (see |13tll8j). Given a finite set P with n points in (X, d), two parameters r, e > 0, 
and a (r, r(l + e),p\,p2) -sensitive family J- of hash functions for some constants p\ > pi, the LSH 
data structure has size 0( n ^- Inn) and answers (r, e)-VC£B queries correctly with high probability 

in 0(^r^ Inn) time, where g = ^<l. 

Note that the running time bound ignores the time needed to compute distances and to evaluate 
hash functions. These typically depend on the metric space (X, d) and hash family T considered. 
The probabilities px,p% also depend on J 7 , therefore they may vary with r and e. 



2.4 The case of affine spaces 

In most of the paper the ambient space X will be the affine space M. d equipped with some 
norm, s 6 (0,2], and d will denote the induced distance: Vx,y G M. d , d(x,y) = \\x — y\\ s 



J2i=i \ x i ~ yi\ s j i where X{,yi stand for the i-th coordinates of x,y. 
In (K d ,£ s ) we use the families of hash functions introduced by Datar et al. [llj^J which are 
derived from so-called s-stable distributions. A distribution D over the reals is called s-stable if any 
linear combination ]T\ o^Xi of finitely many independent variables X{ with distribution D has the 
same distribution as \ a i\ a )^ s X, where X is a random variable with distribution D. Given such 
a distribution D, one can build (r, r(l + e),pi,p2)-sensitive families of hash functions in (M. d , £ s ) for 
any radius r > and any approximation parameter e > as follows. First, rescale the data and 
query points so that r = 1. Then, choose a real value w > and define a two-parameters family 
of hash functions T = {f a>b : R d ->■ Z} a6ffi d jfee [ 0jU ,) by f a ,b{x) = L^i^J, where • stands for the 
inner product in M. d . The probability distribution over the family is not uniform: the coordinates 
of vector a are chosen independently according to D, while b is drawn uniformly at random from 
the interval [0, w). The local sensitivity of this family depends on the choice of parameter w. More 
precisely, according to Datar et al. [11], given two points at distance I of each other, the probability 
(over a random choice of hash function) that these points collide is 

*(')= f yfcff) fi-AV- (i) 



l/s 







I \l J \ w r 

where in denotes the probability density function of the absolute value of D. The probabilities 



Pi,P2 in Theorem 2.3 are then obtained as $(1) and <&(! + e) respectively. They do not depend on 



r, thanks to the rescaling. Note that they do note depend on the dimension d either. 



1 A possible improvement would be to use the hash functions defined by Andoni and Indyk [5] instead, which are 
known to give better complexity bounds. For now we leave this as future work. 
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Focusing back on Har-Peled's construction, recall from Theorem 2.1 that each node v of 



the tree T(P,e) stores 0(-ln") data structures for answering (r, e)-VC£B queries, each of size 
0(\P V \ 1+Q In \ P V \). Let us point out that by construction the subsets of P assigned to the sons of v 
form a partition of P v . Then, a recursion gives the following bounds on the size of T(P,e) and on 
the query tim^J 

Corollary 2.4 (see |15j). Given a finite set P with n points in (R d ,£ s ), s £ (0, 2], and a parameter 
e > 0, the tree structure T(P,e) and its associated (r,e)-'P££B data structures can answer e- 

NN queries correctly with high probability in O in^ 2 l nri hi f ^ time using O In 2 rain j^j 

space, where g = j^^- < 1, the quantities p\ = $(1) and p2 = <&(1 + e) being derived from some 
s -stable distribution D according to Eq. Q). 

Here again the running time bound ignores the time needed to compute distances and to evaluate 
hash functions, which is 0(d) per operation (distance computation or hash function evaluation) in 
M. d . From now on we will also ignore poly-logarithmic factors in ~ and hide them within big-O 
notations for the sake of simplicity. Thus, the time and space complexiti es gi ven in Theorem |2.3| 
become respectively Q( ~ ini/ P2 ) an< ^ Q( n pi ), while those given in Corollary 2.4 become respectively 

The challenge now is to choose a value for parameter w that makes g as small as possible. The 
best value for w heavily depends on s and e, and it may be difficult to find for some values of s, e, 
especially when no closed form solution to Eq. is known. Two special cases of practical interest 
(s = 1 and s = 2) are analyzed in [IT] : 

• In the case s = 1, one can use the Cauchy distribution (which is 1-stable) to derive a family 
of hash functions, and the probability of collision becomes <]?(/) = 2 aicta ^i w / 1 ) _ ^J^^ i n (l + 

(w/l) 2 ). The ratio g = J^ 1 lies then strictly above n-, yet larger and larger values of 
parameter w make it closer and closer to 

• In the case s = 2, one can use the normal distribution A/"(0, 1) (which is 2-stable), and the 
probability of collision becomes 3>(Z) = 1 — 2Fj^(—w/l) — , ; (1 — e~~ w I 21 ), where Fj^ 



stands for the cumulative distribution function of Af(0, 1). The ratio g = lies then below 
jq^r for reasonably small values of parameter w. 

The results obtained by Datar et al. [UJ can be extended to any s £ [1,2] via low-distortion 
embeddings [19] . In the rest of the paper we will follow [UJ and use respectively the Cauchy 
distribution and the normal distribution in the cases s = 1 and s = 2. An analysis of the influence 
of the choice of parameter w on the quantities g, i and ln y will be provided in Section 



3.2 



3 Exhaustive r-VCSB 

Let (X, d) be a metric space and P a finite subset of X. The following variant of r-VC£B, where 
all the r-balls containing the query point are asked to be retrieved, will play a central part in the 

2 Our complexity bounds differ from the ones of Har-Peled et al. [TS] in that the lnlnn factor in their bounds is 
replaced by a Inn factor in ours. This difference comes from the fact that we run the LSH procedure to Inn times, for 
a fixed constant to, to make its output correct with probability at least 1 — so the full e-JVW algorithm can be 
correct with probability at least 1 — — , which will be useful in the rest of the paper. By contrast, the analysis in [T^] 
only runs the LSH procedure O(lnlnn) times, to make the e-MM algorithm correct with constant probability. 
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rest of the paper: 



Problem 6 (Exhaustive r-VC£B). Given a query point q € X , the exhaustive r-VC£B query asks 
to return the set Bp(q,r). 

This problem is introduced under the name near-neighbors reporting in previous literature |29| 



Chapter 1], where a variant of the LSH scheme of Section 2.3 is proposed for solving it. The 
difference with the original LSH scheme is that the query procedure does not stop when 3L collisions 
with the query point q have been found, but instead it continues until all the points colliding with 
q in the L hash tables have been collected. The output is then the subset of these points that lie 
within Bp(q, r). The details of the pre-processing and query phases are given in Algorithms [l] and [2] 
respectively, where the data structure is called A(P, r, e). Note that parameter e no longer controls 
the quality of the output, which is shown to coincide with the set Bp(q,r) with high probability, 
but instead it influences the average complexity of the procedure, as we will see later on. 



Input : metric space (X, d), finite set P with n points in X, parameters r, e > 
Output: A(P, r,e) data structure 

1 Take an (r, r(l + e), pi, p%) -sensitive LSH family T\ 

2 Let k = Tr^l and L = |~— 1, where g = 

1 In 1 /p2 1 1 Pi 1 K lnp2 ' 



2.3 



3 Create the /c-dimensional hash family Q as described in Section 

4 for i = 1 to [ckm] // c is a constant to be explicited later 

5 do 

6 pick L functions {gi, ■ ■ ■ , 9l} independently at random from Q; 

7 Create the corresponding hash tables {Hi, . . . , H^}; 

8 forall the p € P do 

9 for j = 1 to L do 

10 | Insert p into Hj using the key gjijp); 

n end 

12 end 

13 Store the data structure Ai(P, r, e) := {gi, ■■ ■ , gi,} U {Hi, ■ ■ ■ ,Hl}; 

14 end 

15 Output A(P,r,e) := \JiAi(P,r,£); 



Algorithm 1: Pre-processing phase for exhaustive r-VC£B 



In Section 3.1 we revisit the analysis of [29^ Chapters 1 and 3] and quantify more precisely the 
amount of collisions with the query point that may occur within the hash tables. To this end we 
use the following refined concept of locality-sensitive family of hash functions^] 

Definition 3.1. Given a metric space {X, d) and positive radii ro < r\ < Ti, a, family T = 
{/ : X — > Z} of hash functions is called (ro, ri, r2,Po,Pi,P2)-sensitive if there exist quantities 
1 > Po > Pi > P2 > such that Vx, y G X , 

(i) d(x,y) < n => Pr{/(x) = f(y)} > pi, 

(ii) d(x,y) > r 2 => Pr{/(x) = f(y)} < p 2 , 



3 An even finer concept, proposed in |29l § 3.3], makes the probability of having f(x) = f(y) a function of the 
distance between x and y. However, for our purpose it is not necessary to go to this level of refinement. 
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Input : metric space (X, d), A{P,r,e) data structure, query point q £ X 

1 Let k, L, g and c be denned as in Algorithm [TJ 

2 Initialize the output set: S := 0; 

3 for i = 1 to [clnre] do 



4 
5 
6 
7 
8 
9 
10 
11 
12 



Let {g\, . . . , gi} be the functions and {H\, . . . , Hl} the tables contained in Ai(P, r, e); 
for j = 1 to L do 

Compute gj{q) and retrieve the set Cj of the points colliding with q in JJj; 
for all the p E Cj do 
if d(<7,p) < r then 

Update the output set: S := S U {p}; 
end 
end 
end 



13 end 

14 Return S; 



Algorithm 2: Online query phase for exhaustive r-VC£B 



(iii) d(x,y) > r => Pr{/(x) = /(y)} < po, 

where probabilities are given for a random choice of hash function f E T according to some proba- 
bility distribution over the family. 

Axioms (i) and (ii) correspond to the classical notion of locality-sensitive family of hash functions 



(Definition 2.2). They do not make it possible to limit the number of collisions between the query 
point q and the points of Bp(q,r\) in the analysis of exhaustive r\-VC£B queries. Specifically, 
every point of Bp{q,r\) might collide with q in every hash table in theory, thus raising the cost of 
an exhaustive r\-V LEB query to £l(n e ) per point of Bp(q,r{). This is in fact all theoretical, since 
in practice the hash functions are likely to make a difference between those points of Bp(q, n) that 
are really close to q and those that are farther away. This is the reason for introducing the third 



axiom (iii), which will prove its usefulness in Section 3.2 where we concentrate on the case where 
the ambient space is (M. d , 4), s £ (0, 2], and show that a non-isometric embedding of the data into 
(R d+1 ,£ s ) enables us to move the sets of data and query points away from each other. 

3.1 Revisiting the analysis in the general case 

Theorem 3.2. Given a finite set POX with n points and two parameters r, e > 0, if (X, d) admits 
a (ri,r2,pi,P2)-sensitive family P of hash functions withr\ =r andr2 < r(l + e), then Algorithm^ 
answers exhaustive r-VLSB queries correctly with high probability in expected Q(p~( i n i/ P2 + 1 + 
\&p{Qi r (l + £ ))\)) time, involving + \Bp(q, r(l + e))|) distance computations and 0( pi i"i/ P2 ) 

hash function evaluations only, and using 0~(—^-) space, where g = j^ 1 - If moreover the family 
P is (ro,ri,r2,po,Pi,P2)sensitive for some ro < r\, then for any query point q G X the algorithm 
answers the exhaustive r-VC£B query in expected Q(^~( i n i/ P2 + 1 + \Bp(q, tq)\) + ^\Bp(q, r(l + 

e)) \#(<?> r o)|) time, where a = g(l - |^) < g. 

The first term ( — ^ y P2 ) in the running time bound corresponds to the complexity of a standard 
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(r, e)-VC£B query and can be viewed as the incompressible time needed to locate the query point 
q in the data structure. The second term (^-) bounds the total number of collisions of q with data 
points lying outside B(q,r(l + e)). The third term (^\Bp(q, ro)|) arises from the fact that a data 
point lying within distance ro of q may collide in every single hash table with q. Finally, the last 
term ( r ^-\Bp(q,r(l + e)) \ B(q,ro)\) arises from the fact that the points of Bp(q, r(l + e)) that lie 

farther than ro can only collide up to ^- times with q each, for some a < g. Note that the less 

sensitive the family T between radii ro and r%, the closer to 1 the ratio jj^-, and therefore the 
smaller a compared to g. By contrast, the more sensitive the family between radii r\ and ri, the 
smaller the ratio g = compared to 1. 



Our proof of Theorem 3.2 follows previous literature |15j and is divided into three parts: (1) 
proving the correctness of the output of Algorithm [2] with high probability, (2) bounding the 
expected query time, and (3) bounding the size of the data structure. The novelty resides in 



Lemma 3.6, which exploits the axiom (iii) of Definition |3. 1| to bound the number of collisions of q 



with points of Bp(q, r(l + e)) \ B(q, ro). 

Correctness of the output. Note that the test on line[8]of Algorithm [2] ensures that the output 
set S is always a subset of Bp(q,r). Thus, we only need to show that S contains all the points of 
Bp(q,r) with high probability at the end of the query. 

Lemma 3.3. Bp(q,r) C S with probability at least 1 — ra 1_cln 2. 

This result means that the probability of success of the query is high, even for small values of 

c. For instance, it is at least 1 — - for c > ^Kr. and more generally it is at least 1 — \j for c > 

' n — in | ' n" — In | 

Proof of the lemma. Let p be a point of Bp(q, r). Consider a single iteration % of the main loop of 
Algorithm[2j and let us show that p is inserted in the output set S during this iteration with constant 
probability. This is equivalent to showing that, with constant probability, there exists some function 
gj(-) that hashes q and p to the same location (gj(q) = 9j(p))- Since d(q,p) < r, the probability 

r TV ■ r £ ! • ■ x i k Unl/po 1 -lnl/pi h '?/" 1 ^ -lnl/pi( , +1) 

of a collision for a fixed j is at least p\ = p 1 2 = e , iai/ P2 1 > e um/ P2 / _ 

lnl/Pl 

pin lnl /P2 = pYn~ Q . Therefore, the probability that no hash function gj generates a collision is at 
most (1— pin~ e ) L < (l—p\n~ e ) ne l pi since L = \^~\ functions are picked from Q at iteration i. Thus, 

the probability that this iteration inserts p into the output set S is at least 1 — (1 — pin~ e ) ne l pi > 
1 _ I > 3 

Now, there are [clnn] iterations in total, with independent hash functions, so the probability 
that p ^ S at the end of the query is at most Q^ clnn ^ = g in § T c ln < n cln |. Applying the union 
bound on the set Bp(q, r), we obtain that the probability that all points of Bp(q, r) belong to S at 
the end of the query is at least 1 - \B P (q, r)|n cln i > 1 -n 1+cln § = l-n 1_cIn l. □ 



Remark 3.4. It is easily seen from the final paragraph of the proof of Lemma 3.3 that the correct- 
ness of the output can be guaranteed with probability 1 — m 1_cln 2 for any given m > n. Indeed, by 
running [clnm] iterations of the main loops of Algorithms^ and^ instead of [clnn] iterations, 
we obtain that each point of Bp(q,r) belongs to S at the end of the query with probability at least 
1 — m _cln 2 ; and thus that Bp(q,r) C S with probability at least 1 — m 1_cln 2. This remark will be 
useful when dealing with 1ZNN queries in Section^ 
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Expected query time. First of all, the query point q is hashed into [^-] [chm] = O(^) 

hash tables in total, and each hashing operation involves k = = 0( lr |y" 2 ) hash function 

evaluations, c being a constant here. Thus, the total number of hash function evaluations is 
( pi l" i/ P2 )' ano - so * s ^ e total time spent hashing q (modulo the time needed to do a hash function 
evaluation, which is ignored here as in the previous sections). There remains to bound the expected 
number of colllisions of q with points of P in the hash tables. 

Lemma 3.5. The expected total number of collisions of q with points of P\B(q,r(l + e)) is 0(J^). 

Proof. Take an arbitrary iteration i of the main loop of Algorithm [2j and an arbitrary hash table 
Hj considered during that iteration. Recall that the hash family Q is constructed in Algorithm [T] by 
concatenating k = |" ln "? ] functions drawn from a (r, (1 + e)r, pi,p2)-sensitive family T. Therefore, 

the probability that a given point of P \ B(q,r(l + e)) collides with q in Hj is at most p\ = 
phni/^ _ e -lni/p2r hl i / p 2 l < e -lni/ P2 lnl/p2 _ I f n ows that the expected number of points of 
P\B(q, r(l+e)) that collide with q in Hj is at most 1, from which we conclude that the expected total 
number of such collisions in all the hash tables at all iterations is at most [~^] [chin] = 0{^). □ 

Without any further assumptions on the family T of hash functions, each point of Bp(q, r(l+e)) 
might collide with q in every hash table. The number of collisions of q with points of Bp(q, r(l +e)) 



is therefore 0( 1 ^-clnn\Bp(q, r(l + e))|) = 0(^\Bp(q, r(l + e))|). Combined with Lemma 



3.5 



this 



bound implies that the expected running time of the algorithm is 0(^-(^ T j^ + l + \Bp(q, r(l+e))|)), 
as claimed in the theorem. For every collision considered, a test is made on the distance between 
q and the colliding point of P (see line [8] of Algorithm [2]) . With a simple book-keeping, e.g. by 
marking the points of P that have already been considered during the query, we can afford to do 
the test at most once per point of P, thus yielding a total number of distance computations of the 
order of 0(^ + \B P (q,r(l + e))\). 

Consider now the stronger hypothesis that the family T of hash functions is (ro, r, r(l + 
e),P0)Pi)P2)-sensitive for some ro < r. 

Lemma 3.6. Assuming that T is (ro,r, r(l + e),po,pi,P2)sensitive, the expected total number of 
collisions of q with points of Bp(q,r(l + e)) \ B(q,ro) is 0(J^\Bp(q, r(l + e)) \ B(q,ro)\), where 

Proof. Take an arbitrary iteration i of the main loop of Algorithm [2j and an arbitrary hash table 
Hj considered during that iteration. The probability that a given point p G Bp(q, r(l + e))\B(q, ro) 

|" lll n ] j r Inn -i jn „ In n In pq 

collides with q in Hj is at most p% = p Q n /P2 = e lnl / p 2 1 < e po i* Vp 2 = n lnp 2 . It follows that 
the expected total number of collisions between p and q during the execution of the algorithm is at 



In 



til 



most n" 1 ^ |"nq r c hinl = O(^), where a = g - £b> = _ |H£o = jm n _ inpox w concm d e 

I pi I I I \ pi /' = mp2 mp2 lnp2 lnp2 v mpi / 

that the expected total number of collisions of q with points of Bp(q, r(l +s)) \ B(q,ro) during the 
course of the algorithm is 0( r ^-\Bp(q,r(l +e)) \ B(q, ro)|). □ 



It follows from Lemma 



3.6 



that the expected query time becomes Q(^-( lnl , +1 + \Bp(q, ro) |) + 



^-|jBp(g, r(l + e)) \ S(g,ro)|) when the family J 7 of hash functions is (ro,r, r(l + e),Po,Pi,P2)- 
sensitive, as claimed in the theorem. 
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Size of the data structure. Each hash table contains one pointer per point of P, and there are 
r§i"l [ c l nn l such hash tables in total, so we need to store 0{ rL ^-) pointers in total. In addition, we 
need to store the [~^] [clnre] vectors of hash functions corresponding to the hash tables, but this 
term is dominated by the previous one. Thus, in total our data structure has a space complexity 
of o(2_T£). This bound ignores the costs of storing the input point cloud and the selected hash 
functions, which depend on the type of data representation. 



3.2 AfRne case: the non-isometric embedding trick 

Assume from now on that the ambient space is (M. d ,£ s ), where s G (0, 2], and note that axiom (hi) 



of Definition |3.1| is satisfied by the families of hash functions introduced in Section 2.4 since the 
probability <&(/) defined in Eq. decreases as the distance I increases. In order to prevent the 
points of P from getting too close to the query point q, so axiom (iii) can be exploited, our strategy 
is to apply a non-isometric embedding into (M. d+1 ,£ s ) that moves q away from P, while preserving 
the order of the distances to q. 

At preprocessing time, we lift the points of P to £ s ) by adding one coordinate equal to to 

every point. We then build an A(P', r', e') data structure using Algorithm[TJ where P' denotes the 
image of P through the embedding, r' = r(l+ ^ 1+£ 1 ^_ 1 and e' = ((1 + e) s + (1 + e)~ s — l) 1 ^ — 1. 



In effect, right before building the data structure we follow Section 2.4 and rescale P' by a factor 
of 1/r', to get a normalized point cloud P" on top of which we build an A(P", 1, e') data structure 
using Algorithm [TJ 

At query time, we lift q to R d+1 by adding one coordinate equal to (( 1+e )I Yp7>> > then we answer 

an exhaustive r'-VC£B query in M. d+1 by running Algorithm [2] with the A(P' , r' , e') data structure, 
and then we return the pre-image of the output set through the embedding. Once again, in effect 
we rescale the image of the query point in M a!+1 by a factor of 1/r', so Algorithm [2] is actually run 
with A(P",l,e'). 

Note that the embedding into M. d+1 is not isometric since it does not preserve the distances of q 
to the data points. However, it does preserve their order. Indeed, for every point p G P the distance 
d(p, q) becomes (d(p, q) s + ^ 1+ ^ s _ 1 ) 1//s after the embedding. Since the map 1 1— > (t s + ^ 1+ r e ^ a _ 1 ) 1 ^ s 
is monotonically increasing with t, the embedding preserves the order of distances to q. We then 
have the following easy properties, where x' G M. d+1 denotes the image of any point {q} 
through the embedding: 

[l+ey-l ) - r { l + (l+e)«-l 
V« - ( (l+ e )» \ 1 / 3 - (. 1 \ 1 / s 



(i) VpGP, d(p,q)<r^d(p',q')< (r s + n ^ T ) = r 1 + jtA^t) = r'; 



ii) Vp G P,d( P ' ! ,0>( I r+g^ T J =ife(^b) = r fe(i+(i+^ I ; — • 

1/8 



(hi) Vp G P, d{p', q>) < r'(l + e>) ^ d{p, q) < [r ls (l + e') s - jj^y^ 

(^(1 + (T+i)^T)(( 1 + £ ) S + C 1 + £ )" S " !) " (T+g^ ^ 
M^((l + ,) s + (1 + e)~ s - 1) - ^ 



jl^-lU^T^ -r V x-rc, (i +£ ) 

r(l + e). 

It follows from (i) that Bpi(q',r') is the image of Bp(q,r) through the embedding. Hence, by 



Lemma 3.3, with high probability the output set of the exhaustive r'-VCSB query in M. d+1 is the 
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image of Bp(q,r) through the embedding. Thus, our output is correct with high probability. In 



the meantime, the embedding has the following impact on the complexity bounds of Theorem 3.2 

• On the negative side, parameter e is now replaced by e' = ((1 + e) s + (1 + e)~ s — l) l / s — 1 < e, 
which increases p2 from $(1 + e) to $(1 + e 1 ). This means that the ratio g = becomes 

EWI+e 7 ) — in$^i+e) an< ^ ^ us S e ^ s closer to 1, even though it still remains strictly below 1. 
Furthermore, the term grows from lnl/ J (1+e) to lnl/# 1 (1+e , ) . 

• On the positive side, we know from (ii) that the points of P' lie at least ^r; away from the 



query point q, so by Lemma 3.6 they cannot collide with q more than 0(— j-) times each in 



expectation, where a = g(l — gp^)> Pi = ^K-Q> ana ^ Po = ^Ki^e)- 

For the rest, the embedding is a neutral operation. Indeed, even though the complexity now depends 
on the size of Bpi(q',r'(l + e')) instead of the size of Bp(q, r(l + e)), we know from (hi) that the 
preimage of the former set through the embedding is contained within the latter set, so we have 
\Bp'(q' , r'(l + e'))| < \Bp(q,r(l + e))|. In addition, the fact that the query now takes place in 
instead of M. d , with a radius parameter that grew from r to r', does not affect the probabilities 
Pi,P2, which depend neither on the ambient dimension as pointed out after Eq. Q, nor on the 
radius thanks to the rescaling of the data. It also does not affect the asymptotic complexities of 
distance computations and hash function evaluations, which remain 0(d). 

All in all, we obtain the following complexity bounds for the exhaustive r-VCEB query in 
(R , £ s ), where A'(P, r, e) denotes the full data structure built at preprocessing time, which contains 
the embedding and rescaling information together with the A(P" , l,s') data structure: 

Theorem 3.7. Given a finite set P with n points in (R , 4),s£ (0, 2], and two parameters r, e > 0, 
the A'(P,r,e) data structure answers exhaustive r-VCEB queries correctly with high probability in 
expected 0(^{^tj^ + 1) + ^-\B P (q,r(l + e))|) time using 0(^-) space, where g = and 

a = e(l-E^) < Q, the quantities p = &(^), Pi = *(1) and p 2 = $(((1 + e) s + (1 + e)- s - l) 1 / 5 ) 
being derived from some s-stable distribution D according to Eq. Q). 

Quantifying precisely the amounts by which the quantities g, a and ln y are affected by the 

embedding, what the corresponding best choice of parameter w is, and how this choice impacts — , 
are the main questions at this point. Because Eq Q may not always have a closed form solution, 
it is difficult to provide an answer in full generality for all values s G (0,2]. We will nevertheless 
investigate two special cases that are of practical interest: s = 1 and s = 2. 

2 

Case s = l. The definition of e gives e = in this case. The formula for g is then the same 
as in M d , with e replaced by -j^. As reported in [11] and illustrated in Figure [l] (left), g remains 
above 1+e 2^ 1+£ ^ , even though it seems to converge to this quantity as w tends to infinity. Letting 
w = max{l, e}, we found experimentally that g is dominated by 1+£ 2/ 4 when e < 1 and by 1+ ^ 4 
when e > 1, as shown in Figures [T] (right) and [2] (left). In the meantime, a is less than eg, as can 
be seen from Figure 2 (right), while and ln y # 1 (1+e , ) are less than 4 and 1 respectively, as shown 
in Figure [3} All in aTT, Theorem |3.7| can be re- written as follows: 

Theorem 3.7 (case s = 1). Given a finite set P with n points in (M. d , and two parameters r, e > 
0, the A'(P,r,e) data structure answers exhaustive r-VCEB queries correctly with high probability 



13 



Figure 1: Behavior of g in (M. d+1 ,ii). Left: plots of g (blue) and = 1+e y^ 1+e ^ (red) 
and w. Right: plots of g (blue) and 1+mi ( re d) versus e and w. 
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Figure 2: Behaviors of g and a in (M a!+1 , i\) after letting w = max{l, e}. From left to right, in blue: 
plots of g(l + min{e 2 , and Both plots are versus e on a logarithmic scale (x = log 10 ej. 

The red lines have equation y = 1. 



in expected 0{n e + n a \Bp(q, r(l + e))|) time using 0{n +e ) space, where g < 1+min { £ 2 ^/g}/4 
and a < eg < e. 
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Figure 3: Behaviors of g^-y and lnl ^ 1 (i+e , ) in (M. d+1 ,£i) after letting w = max{l,e}. From left 
to right, in blue: plots of and i n i/ $ ( 1+e /) • Both plots are versus e on a logarithmic scale 

(x = log^ej- The red lines have equation y = 1. 
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Figure 4: Behavior of g in (R d+1 ,l2). Left: plots of g (blue) and = —j==== (red) versus e 
and w. Right: plots of g (blue) and 1+£ 2^ 1+£ ) (red) versus e and w. 



Case s = 2. The definition of e' gives e' = — — 1 in this case. The formula for g is 

then the same as in M. d , with e replaced by — 



1. As pointed out in and illustrated 
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Figure 5: Behaviors of g and a in (M d+1 ,^2) after letting w = max{l,e}. From left to right, in 
blue: plots of g(l + e 2 /(l + e)) and f^. Both plots are versus e on a logarithmic scale (x = \og w £). 
The red lines have equation y = 1. 
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Figure 6: Behaviors of and lnl ^ 1 (1+g/) in (M^ 1 ,^) after letting w = max{l,e}. From left 
to right, in blue: plots of and ln y $ 1 (1+e ; ) ■ Both plots are versus e on a logarithmic scale 

(x = log 10 e). The red lines have equation y = 1. 



in Figure |4j (left), q goes below ^^ £ ^ A = & t reasonably small values of parameter w. Since this 
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bound is not quite evocative, we used a slightly different bound, namely 
experimentally that g < 1+£ 2/( 1+£ ) 

(left). In the meantime, a is less than eg, as can be seen from Figure 
^jYj an d ini/^i+e') are bounded by small constants, as shown in Figure 
can be re-written as follows: 



l 

l+£ 2 /(l+£)' 



and we found 



whenever w = max{l,e}, as shown in Figures 4 (right) andjHJ 



(right), while the terms 
All in all, Theorem EtI 



Theorem 3.7 (case s = 2). Given a finite set P with n points in (M. d , £2), and two parameters r,e > 
0, the A'(P,r,e) data structure answers exhaustive r-VC£B queries correctly with high probability 
in expected 0(n e + n a \Bp(q,r(l + e))|) time using 0(n 1+e ) space, where g < 
a < eg < e. 



t+^tW) < 1 and 



4 Interlude: from exhaustive r-VCSB to exact AW 

Before dealing with 1ZMM queries (the main topic of the paper), let us show a simple but pedagog- 
ical application of exhaustive r-VC£B queries to exact MM search. Given a set P with n points 
and a user-defined parameter e > 0, we will show that MM queries can be solved exactly with high 
probability on any query point q in expected Q{n e + n a \Q(e)-MM p(q)\) time using 0(n 1+e ) space, 



for some quantities g = 1+ q( £ 2) < 1 an d a < eg < e (Theorem 4.3). The running time bound is 
composed of two terms: the first one is sublinear in n and corresponds to a standard approximate 
e-MM query using locality-sensitive hashing; the second one depends on the size of the approximate 
nearest neighbors set 0(e)-MMp(q) and indicates that the solution to the exact query is sought 
for among this set. Whether the bound will be sublinear in n or not in the end depends on the size 
of the set compared to the quantity n 1_Q . This follows the intuition that finding the exact nearest 
neighbor of q is easy when q does not have too many approximate nearest neighbors, and in this 
respect the quantity \0(e)-MM p(q)\ plays the role of a condition number measuring the inherent 
difficulty of a given instance of the exact MM problem. The interesting point to raise here is that 
the limit on this number for our algorithm to be sublinear is at least of the order of n l ~ £ since we 
have a < e. 

Let us point out that the above bounds are for the ambient space M rf equipped with the l\- or 
^2-norm. Our analysis will be carried out in the more general setting of an £ s -norm, with s G (0, 2], 
where we will derive more general complexity bounds. The choice of (M. d ,£ s ) is mainly for ease 
of exposition, since the algorithm can actually be applied in arbitrary metric spaces that admit 
locality-sensitive families of hash functions, where its analysis extends in a straightforward manner 



(see Remark 4.4 at the end of the section). 



The algorithm. Let P be a finite set of n points in (K d ,£ s ), s 6 (0,2], and let e > be a 
parameter. The preprocessing phase consists of the following steps: 



i. Build the tree structure T{P,e) of Section 2.2 and its associated (r,e)-VC£B data structures 



ii. For every (r,e)-VC£B data structure built on some subset of P at step i, build an A'(P, r, e) 



data structure using the procedure of Section |3.2 
Then, given a query point q, we proceed as follows: 

1. Answer an e-MM query using the tree structure T(P,e), and let r > be the output value. 

2. Answer an exhaustive r-VC£B query using the A'(P, r, e) data structure, and let S be the 
output set. 
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3. Iterate over the points of S and return the one that is closest to q. If S is empty, then return 
any arbitrary point of P. 
Note that the execution of step 2 is made possible by the fact that the algorithm solving the e-NN 
query at step 1 returns a radius r that is stored in one of the A'(P, r, e) data structures built during 
the preprocessing phase. For any other value r we would not be able to perform step 2 because we 
would not have the corresponding A'(P,r,e) data structure at hand. 



Analysis. We begin by showing the correctness of the query procedure: 

Lemma 4.1. The query procedure returns a point of NN p(q) with high probability. 



Proof. Corollary 2.4 guarantees that the radius r computed at step 1 satisfies d(q, P) < r < 
d(q,P)(l + e) with high probability. Under this condition, we have NNp(q) C Bp(q,r), and so 
Theorem 3.7 guarantees that the set S computed at step 2 contains NNp(q) with high probability. 
It follows that the point returned at step 3 belongs to NN p{q) with high probability. □ 

We will now analyze the expected running time of the query. Let D be the s-stable distribution 
used by the algorithm, and let p = ${jt^), pi = $(1), P2 = $(1 + £ ) an d V2 = <*K((1 + e Y + (1 + 



e) s — l) 1 / 5 ) be derived from D according to Eq. (hi). By Corollary 2.4, the running time of step 



1 is 0( — )i where g = |^^- The running time of step 3 is 0(|5|), so it is dominated by the 
running time of step 2. 

Lemma 4.2. The expected running time of step 2 is Q(^-( i n j/ p ^ + 1) + ]^~l e (2 + £ )--AWp(?)|)> 
where g' = 1 ^ and a = g> (1 - ^) . 



Proof. Let r be the radius computed at step 1. By Theorem |3.7[ the expected running time of step 
2 is ^(^(e^ + 1) + %\B P {q, r(l + e))\). If r < d(q, P)(l + s), then we have B P (q, r(l + e)) C 

e(2+e)-NN p{q) and so the expected running time becomes 6(^- ( ln y p , +1)+|^ \e(2+e)-NNp (q) |) . 
By contrast, if r > d(g, -P)(l + e), then we have no bound on the size of Bp(q, r(l + e)) other than 

~ q a+1 

n, so the expected running time of step 2 becomes 0( — ( 1 }, , + 1) + - — ). Now, recall from 

Pi m IV'i Pi 

Section [2] that the event that r > d(q, P)(l+s) only occurs with very low probability, more precisely 
with probability at most ^. Therefore, in total the expected running time of step 2 is bounded by 

o(£(E^+i)+£K2+ e ^ 

since the set e(2 + e)-NNp(q) contains at least one point, namely the nearest neighbor of q. □ 



Let us now focus on the size of the data structure. By Corollary 2.4, the total size of the 



l+e . 



tree T{P,e) and associated (r,e)-VC£B data structures is 0(- r1 ^- ). In addition, since T(P,e) 
has 0(n) nodes in total, each one storing O(-) data structures for (r,e)-VC£B, the total number 
of A'(P,r,e) data structures built at step ii of the preprocessing phase is O(f)- Therefore, by 



3.7 



the total memory usage of the A'(P,r,e) data structures is Q(^ "^ e ). 



Theorem 

Observing now that we have p' 2 > P2 and g 1 > g since ((1 + e) s + (1 + e)~ s — l) 1 / 5 < 1 + e, we 
conclude that our procedure has the following space and time complexities (where p' 2 and g' have 
been renamed respectively P2 and g for convenience): 
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Theorem 4.3. Given a finite set P with n points in (M. d ,£ s ), s E (0,2], and a user-defined 
parameter e > 0, our procedure answers exact MM queries with high probability in expected 
0(f(uW 2 + !) + f K 2 + e)-MM P (q)\) time using 0{\^) space, where g = £g and a = 
g(l - j^), the quantities p = *(^), Pi = $(1) and p 2 = $(((1 + e) s + (1 + e)~ s - l) 1/s ) being 
derived from some s-stable distribution D according to Eq. 

Replacing Theorem |3.7| by its specialized versions for s = 1 and s = 2 in the analysis immediately 
gives the following complexity bounds: 

Theorem 4.3 (case s = 1). Given a finite set P with n points in (IRVi 

), and a user-defined 

parameter e > 0, our procedure answers exact MM queries with high probability in expected 0{n e + 
n a \e{2 + e)-MM p{q)\) time using 0{^n 2+e ) space, where g < 1+min { £ 1 2 < 1 an ^ ct < eg < e. 

Theorem 4.3 (case s = 2). Given a finite set P with n points in (M rf ,^2); and a user-defined 
parameter e > 0, our procedure answers exact MM queries with high probability in expected 0{n e + 
n a \e{2 + e)-MMp(q)\) time using 0{^n 2+Q ) space, where g < 1+g2 /( 1+£ ) < 1 and a < eg < e. 

Note that in practice a trade-off must be made by the user when choosing parameter e. Indeed, 
the smaller e, the smaller the set e(2 + e)-MMp(q) and the smaller a compared to g, but on the 
other hand the higher g itself. 

Remark 4.4. In our analysis we traded optimality for simplicity since we applied the results from 
Section \3.S\ verbatim. In fact, a closer look at the problem reveals that the points of P lie at least 
d(q, P) > away from the query point q with high probability at step 2 of the query phase. This 
means that no lifting of the data into M. d+1 is actually needed. We then have p' 2 = P2, f?' = g> 
and a careful analysis shows that relevant choices of parameter w reduce g down to ( or at least 
close to) jq^r. In addition and more importantly, not having to re-embed the data means that the 
algorithm can be applied in arbitrary metric spaces (X, d) that admit locality-sensitive families of 
hash functions, where the analysis extends in a straightforward manner. 



5 From exhaustive r-VCEB to exact 1ZMM 



In this section we focus on our main problem (TZMM) and show how it can be reduced to a single 
instance of e-MM search plus a controlled number of instances of exhaustive r-VCE-B. Although 
the reduction is applicable in any metric space, we will restrict our study to the case of U. d equipped 
with an ^ s -norm, s G (0,2], where the non-isometric embedding trick of Section 3.2 can be used 



to speed-up the process. The details of the reduction are given in Section 5.2 its output proven 



correct in Section 5.3, and its complexity analyzed in Section 5.4 The reduction and analysis are 



then extended to the bichromatic setting in Section 5.5 For now we begin with an overview of the 



reduction and of its key ingredients in Section 5.1 



5.1 Overview of the reduction 

Let P be a finite set with n points in (R d ,£ s ), s G (0, 2]. Suppose the distance of every point peP 
to its nearest neighbor in P\{p} has been pre-computed. Then, given a query point q, computing a 
solution to the TZMM query amounts to checking, for every point p G P, whether d(q,p) < d(p, P) 
or d(q,p) > d(p,P): in the first case, p must be included in the solution, whereas in the second 
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case it must not. This check for point p can be done by computing the solution S of the exaustive 
r-V CEB query on input (P, q), with r = d(p, P), and by including p in the answer if and only if it 
belongs to S. Indeed, 

p G TZMMp{q) O d(p, q) < d(p, P) = r O p G Bp(q, r) «4> p G S. 

Thus, computing the set 1ZNN 'p(q) boils down to locating q among the set of balls {B(p, d(p, P)) \ 
p G P}. This observation was exploited in previous work |23] and serves as the starting point of 
our approach. The main problem is that the ball radius r changes with each data point p G P 
considered, so the total number of exhaustive r-V CEB queries to be solved can be up to linear in 
n. To reduce this number, we allow some degree of fuzziness and use a bucketing strategy. Given a 
user-defined parameter e > 0, at pre-processing time we compute and store d(p, P) for every point 
p€P and then we hash the data points into buckets according to their nearest neighbor distances, 
so that bucket Pi contains the points p G P such that (1 + e) l_1 < d(p,P) < (1 + e)\ At query 
time, we solve an exhaustive r-V CEB query with r = (l + e) 1 on each bucket Pi separately, then we 
consider the union S of the solutions and prune out those points p G S such that d(p, q) > d(p, P). 
Since the points p G Pi satisfy (1 + e) l ~ l < d(p, P) < (1 + e) 1 , it is easily seen that lZNNp(q) C 
S C e-lZMN p(q) and that our output is an admissible solution to the 1ZMM query. 

A remaining issue is that we do not impose any constraints on parameter i, so at query time we 
need to inspect every single non-empty bucket Pj. As a result, in pathological cases such as when 
all non-empty buckets are singletons, we will end up considering a linear number of buckets, even 
though the set e-lZNNp(q) itself might be small or even empty. To avoid this pitfall, we limit the 
range of values of i to be considered thanks to the following observations, where y is an arbitrary 
point of e-AfAfp(q): 

Observation 1. Every point p G 1ZNN p(q) satisfies d(p,P) > d [ 9 _^ ■ 

Proof. Since p G TZMMp(q), we have p / q and d(p, q) < d(p,P). Moreover, since p / q and 
y G E-J\fJ\f P (q), we have d(q, y) < (1 + e)d(q, P) < (1 + e)d(q,p). It follows that d(q,y) < 
(l + e)d(p,P). □ 

Observation 2. Every point p G 1ZNN p(q) such that d(p,P) > d ^'^ belongs to e-lZNN p(y) U 
{»}• 

Proof. Since p G 1ZMN p(q), we have d(p, q) < d(p,P). In addition, we have d(q, y) < ed(p,P) by 
hypothesis. Hence, d(p, y) < d(p,q) + d(q,y) < (1 + e)d(p,P), which means that either p = y or 
p^e-lZMM P {y). □ 



v) 



Assuming that we have precomputed a data structure that enables us to find some y 
e-AfJ\fp(q), Observation 1 ensures that we can safely ignore the buckets Pi with i < log 1+e 
Furthermore, assuming that the set e-lZNN p{y) has been precomputed, Observation [2| ensures 
that the reverse nearest neighbors of q that belong to the buckets P, with i>l + log 1+e -1m1 can 
simply be looked for among the points of e-lZNN ' p{y) U {y}. Thus, the total number of buckets to 
be inspected is reduced to 0{ \ log |) = 0{\). 
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5.2 Details of the reduction 

Given a finite set P with n points in (R , 4), s £ (0, 2], and a parameter e > 0, our pre-computation 
phase builds a data structure TZJ\fMVS(P,e) that stores the following pieces of information: 

i. A collection of buckets {Pj}j g z that partition P. Each bucket Pi contains those points p G P 
such that (1 + e)' 1 ^ 1 < d(p,P) < (1 + e)' L . To fill in the buckets, we iterate over the points 
p G P, we compute the distance d(p,P) exactl}^] and store it, and then we assign p to its 
corresponding bucket. Once this is done, the empty buckets are discarded and the non-empty 
buckets are stored in a hash table to ensure constant look-up time. On each non-empty 



bucket Pi we build an A' (Pi, (1 + sY,e) data structure using the procedure of Section 3.2 
Note that when applying Algorithm [T] we increase the number of iterations of the main loop 
from [~cln|Pj|] to [clnro], where c = r-^-. 

in j 

ii. For each point y G P, an array P y containing the points p G e-TZMN ' p(y) U {y}, sorted 
by increasing distances d(p, P). Building the array takes 0(n) time once d(p, P) has been 
computed for all p G P. 



iii. The tree T(P, e) of Section 2.2 and its associated (r, e)-V££B data structures. 

Given a point q G M. d , we answer the TZMM query using the lZNNVS(P,e) data structure as 
follows: 

1. We use the tree T(P, e) and its associated (r,e)-VC£B data structures to answer an e-MM 
query, and we let y be the output point. 

2. We use the A' (Pi, (1 + e) 1 , e) data structure to answer an exhaustive (1 + e) l -T 'C£B query on 
each bucket Pi separately, for i lying in the range prescribed by Observations [T] and [2j and 
then we merge the output sets into a single set S. Note that when applying Algorithm [2] on 
Pi we increase the number of iterations of the main loop from [cln|Pj|] to [clnn], where 
c = ^5-, which raises the probability of success of the query from 1 — (which can be as 

low as when Pi is a singleton) to 1 — \. 

3. We add to S the points p G e-TZMN P (y) U {y} s.t. d(p,P) > These are found by 
looking up the value d ^'^ in the sorted array P y by binary search, and then by iterating until 
the end of the array. 

4. We iterate over the points p G S and remove the ones that do not satisfy d(p, q) < d(p, P). 

Upon termination, we return the set S. The pseudo-codes of the preprocessing and query procedures 
are given in Algorithms [3] and [4} 

5.3 Correctness of the output 



Corollary 2.4 guarantees that step 1 of the query procedure retrieves a point y G e-NN p(q) with 
high probability. Let us show that, given that y G e-MM p(q), the final set S output by the query 
procedure satisfies S = 1ZMN p(q) with high probability. For clarity, we let S' be the set of points 



inserted in S at step 2 of the procedure, and S" be the set of points inserted at step 3. The 

l+s 



output of the algorithm is then (S' U S") n KNN P (q). Let P' = (J; P { for i = [log 1+£ + 1 to 



riogi +e «i. 

Lemma 5.1. 1ZMN 'p(q) fl P' C S' with high probability. 

4 This can be done either by brute-force or using the algorithm of Section [4] 
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Input : point cloud P C parameter e > 
Output: TZNNVS(P,e) data structure 

1 Initialize Pj := for i£Z; 

2 foreach p£Pdo 

3 Compute d(p, P) exactly and store it; 

4 Find i s.t. (1 + e)*" 1 < d(p, P) < (1 + e)* and update P := P U {p}; 

5 end 

6 foreach Pi 7^ do 

7 J Build an A' (Pi, (1 + e) 1 , e) data structure; 

8 end 

9 foreach y G P do 



Build the set e-lZNN p(y) U {y} and store it in an array P y ; 
Sort the points p G P y by increasing distances d(p, P) ; 



10 
11 

12 end 

13 Build the tree T(P, e) of Section 2.2 and its associated (r, e)-VC£B data structures ; 



Algorithm 3: Pre-processing phase for IZAfAf. 

Input : 1ZNNT>S(P, e) data structure, query point q £~R d 
1 Answer an e-MM query on input (P, q) , and let y be the output ; 



2 for i 



+lto log 1+£ 



<i{q,y) 



do 

if P / then 

I Answer an exhaustive (1 + e) l -VC£B query on input (Pj, q), and let S*j be the output ; 
end 



3 
4 
5 

6 end 

7 Let S := U, S< ; 

8 Look up the value d ^ y ^ in the sorted array P y by binary search ; 

9 Iterate from the value d ^'^ to the end of the array P y and insert all the visited points into 
S; 

10 foreach p G S do 

11 if d(p, q) > d(p, P) then 

12 J Remove p from S; 

13 end 

14 end 

15 Return 5 ; 



Algorithm 4: Online query phase for TZAfAf. 



Proof. Step 2 of the query procedure builds S' by taking the union of the sets Si generated by 
answering exhaustive (1 + e) l -VC8B queries on the non-empty buckets Pi with query point q. 
For each such Pj, we have 1ZNN p(q) n Pi C Bp i (q, (1 + e)*) since by definition every point p G 



KNNp{q)C\Pi satisfies d(p, g) < d(p, P) < (1+e) 1 . Now, by Theorem|3.2| we have Si = B Pi (q, (1 + 
e) 1 ) with probability at least 1 — \. Thus, lZAfAfp(q)nPi C Si with probability at least 1 — j. Since 
the total number of non-empty buckets is at most n, the union bound tells us that 1ZNN p(q)C\P' C 
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S' with probability at least 1 — — . □ 
Lemma 5.2. Given that y E e-MN p{q) , we have 1ZNN p(q) \ P' C 5"' mtt /lig/i probability. 



Proof. The result follows from Observations [T] and [2j Indeed, every point p £ Pi with i < 
|_log 1+e d< f^ J + 1 satisfies d(p, P) < (1 + e) 1 < d [ 9 _^) and therefore cannot belong to 1ZJ\fJ\fp(q), 

by Observation [IJ In addition, the points p £ 1ZNN p(q) n Pi with i > [~log 1+e d ^'^ ] satisfy 

d{p,P) > (1 + e) i-1 > and therefore belong to e-UMM P {y) U {y}, by Observation [2} 

Hence, all such points p are inserted in S at step 3 of the query procedure. It follows that 
KNN P {q)\P' QS". □ 



It follows from Lemmas pU] and |5\2| that (S'uS")nTZMMp{q) = TZMM P {q) with high probabil- 
ity. In other words, the set S returned after step 4 of the query procedure coincides with 1ZMN p(q) 
with high probability. 

5.4 Complexity 

Let D be the s-stable distribution used by the algorithm, and let po = $>(jt^), Pi = ^(l), P2 = 
$(1 + e) and p' 2 = $(((1 + e) s + (1 + e)~ s - l) 1/s ) be derived from D according to Eq. (m). By 



Corollary 



2.4 



the running time of the e-MN query at step 1 is 0( — )"y P2 )> where £ = jj^- Then, 



for i ranging from log 1+£ d jr^ +1 to log 1+£ a(q ^ y> } the exhaustive {l+eY-VCEB query on the set 

P takes 0(^(4^ + 1) + ^\B Pi (q, (1 + = ©(^(^ + 1) + f |B Pi (g, (1 + e)^)\) 

time in expectation, where g' = j^f and a = £>'(! — ) 5 by Theorem 



d(g,y) 



3.7 



Observe that the 



points p G £ Pi (g, (1 + e) m ) satisfy d(p, q) < (1 + e) m < (1 + e) 2 d(p, P), so we have B Pi (q, (1 + 
^ e(2 + e)-lZJ\fJ\fp(q). Furthermore, since the buckets Pj are pairwise disjoint, so are the 

sets Bp^q, (1 + It follows that the total expected time spent at step 2 is Q(g^~( i n i/ p / + 1) + 

^|e(2 + e)-7?jVAAp(g)|), the factor | in the first term coming from the fact that there are 0{\) 
iterations of the loop. Considering now step 3, the binary search takes 0(log 2 \P y \) = 0(log 2 ra) 
time. For every point p £ P y such that d(p, P) > d( - q ^ , we have d(p, y) > d(p, P) > d ^' j/ - > , so 
d(p, g) < d(p, y) + d(y, q) < (1 + e)d(p, y) < (1 + e) 2 d(p, P) since p £ P y = e-HNN P (y). It follows 
that p £ e{2+e)-lZMM p{q). Hence, the total time spent at step 3 is 0(log 2 n+\e(2+e)-lZNN p(q)\) 
and is therefore dominated by the time spent at step 2. Finally, the time spent at step 4 is dominated 
by the times spent at steps 2 and 3. Combining these bounds together and using the fact that 
p'2 — P2 and q' > Q since ((1 + e) s + (1 + e)~ s — l) 1 ^ < 1 + e, we obtain the following query time 
bound (where p' 2 and g' are renamed respectively P2 and g for convenience): 

Theorem 5.3. Given q £ (R d ,£ s ), the expected query time is 0(Y^{ \ n i/ V2 + 1) + ^~k(2 + 
e)-KNN P (q)\), where g = and a = g(l - |^) ; the quantities Po = Pi = *(1) 

and P2 = ^(((1 + e) s + (1 + e)" s — l) 1 ^) being derived from some s-stable distribution D according 
to Eq. 0. 



Replacing Theorem 3.7 by its specialized versions for s = 1 and s = 2 in the analysis immediately 



gives the following running time bounds: 
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Theorem 5.3 (case s = 1). Given a query point q E (R , £1), the expected running time of 
Algorithm^ is 0(jn e + n a \e(2 + e)-1ZNN p(q)\) , where g < 1+min | £ L 2 < 1 an d ot < eg < e. 

Theorem 5.3 (case s = 2). Given a query point q E (^,£2), the expected running time of 
Algorithmic is 0(jn e + n a \e(2 + s)-1ZNN p(q)\), where g < 1+e y^ 1+e ^ < 1 and a < eg < e. 



As mentioned in Section 5.2, the 1ZNNT)S(P,e) data structure consists mainly of a collec- 
tion of pairwise-disjoint non-empty buckets, of total cardinality n, and for e ach bucket P{ an 

~ n 1+e ' 

A' (Pi, (1 + e) l ,e) data structure of size 0( * ) where ni = \Pi\, by Theorem 3.7 This gives a 



total size of OQ^ = 0(^-)- In addition, 1ZNNVS(P,e) stores the tree structure T(P,e 



and its associated (r,e)-V££B data structures, whose total size is 0(^ Zi -^), by Corollary 



e pi 



2.4 



Finally, lZNNT>S(P,e) stores a vector P y for each point y £ P, which requires a total space o 
0(^2 yeP \P y \), where \P y \ = 1 + \e-1ZNAf p(q)\ < n. Combining these bounds and using the fact 
that g' > g, we obtain the following bound on the size of the data structure (where p' 2 and g' have 
been renamed respectively pi and g for convenience): 

Theorem 5.4. The size of the data structure lZMMT)S(P,e) built by Algorithm^ is 

J2 yeP \e-lZMM P (y)\) = 0(1^ + n 2 ), where g = < 1, the quantities pi = $(1) and p 2 = 

$(((1 + e) s + (1 + e)~ s — l) 1 / 5 ) &einp derived from some s-stable distribution D according to Eq. 

5.5 Bichromatic ^TVA/" 

Let (X, d) be a metric space, and let B, Y be two finite subsets of X, respectively referred to as the 
blue and yellow sets in the following. Given a point x E X, a reverse nearest neighbor of x in this 
bichromatic setting is a point b £ B \ {x} such that x E MMyvj{x\ (p) ■ Le-t; 1ZNN ' b,y(x) denote the 
set of all such points. By analogy, given a parameter e > 0, a reverse e-nearest neighbor of x is a 
point b £ B \ {x} such that x E e-AOvVujz}^), and let s-lZNN b,y( x ) denote the set of all such 
points. The bichromatic version of Problem [3] is stated as follows: 

Problem 7 (Bichromatic 1ZNN). Given a query point q E X, the bichromatic reverse nearest 
neighbors query asks to retrieve the set TZNN b.y(q)- 

Our strategy for answering reverse nearest neighbors queries extends quite naturally to the 
bichromatic setting when the ambient space is M. d equipped with an ^ s -norm, s E (0, 2]. Given two 
finite subsets B, Y of and a parameter e > 0, the data structure and algorithms are the same 



as in Section 5.2, modulo the following minor changes: 



the buckets Pi now partition the blue point set B, and each bucket Pi gathers the points 
b E B such that (1 + ef' 1 < d(b, Y) < (1 + e)\ 

the tree structure of Section |2.2| is now built on top of the yellow set Y , so we can find 
approximate nearest neighbors among the yellow points efficiently, 

for each point y E Y, we now store the set e-lZNN b,y(v) in vector P y , to which we add 
y itself only if the latter coincides with a point of B. The points in P y are then sorted by 
increasing distances to Y. 
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Input : point clouds B,Y C M. d , parameter e > 
Output: lZMMVS(B,Y,e) data structure 

1 Initialize P, := for i £ Z; 

2 foreach b € -B do 

3 Compute d(6, 1") exactly and store it; 

4 Find i s.t. (1 + e)^ 1 < d(6, Y) < (1 + e)* and update P< := P { U {6}; 

5 end 

6 foreach Pi ^ do 

7 | Build an A' (Pi, (1 + e) 1 , e) data structure; 

8 end 

9 foreach y G F do 



Build the set e-TZNM b,y (y) U ({y} n P) and store it in an array P y ; 
Sort the points b £ P y by increasing distances d(6, y); 



10 

n 

12 end 

13 Build the tree structure T(Y, e) of Section 2.2 and its associated (r, e)-VC£B data structures 



Algorithm 5: Pre-processing phase for Dichromatic IZNAf . 



Input : !ZJ\fMT>S(B,Y,e) data structure, query point q £ 
l Answer an e-MM query on input (Y, q) , and let y be the output ; 



2 for i 



loo- d (g^) 



1 to 



log 1+£ « 



do 

3 if Pi ^ then 

4 [ Answer an exhaustive (1 + e) l -VC£B query on input (Pi, q), and let Si be the output ; 

5 end 

6 end 

7 Let S:=U^; 

8 Look up the value d ^'^ in the sorted array P y by binary search ; 

9 Iterate from the value d ^'^ to the end of the array P y and insert the visited points into 5 ; 
10 foreach b € S do 

n if d(b,q) > d(b,Y) then 

12 | Remove b from S 1 ; 

13 end 

14 end 

15 Return 5 ; 



Algorithm 6: Online query phase for Dichromatic TZHH . 



The details of the preprocessing and query procedures are given in Algorithms [5] and [6] for complete- 
ness. The proof of correctness with high probability and the complexity analysis extend verbatim 
to the bichromatic setting, modulo the systematic replacement of point set P by either B or Y. 
We thus obtain the following guarantees: 

Theorem 5.5. Given a query point q G (M. d ,£ s ), Algorithmic answers bichromatic 1ZNN queries 
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correctly with high probability in expected 0(^^-(^jj^ + 1) + ^-|e(2 + b)-TZMMb,y{q)\) time using 
+ E„ e yk-Wfl,y(?/)I) = 0C~ + n 2 ) Ipace, where n = max{\B\, \Y\}, g = and 
a = g(l - g^), tfie quantities p = S^), Pi = $(1) and p 2 = *(((1 + e) s + (1 + e)- s - l) 1 /-) 
6eino derived from some s-stable distribution D according to Eq. Q). 

Theorem 5.5 (case s = 1). Given a query point q G (M d ,^i) ; Algorithm^ answers bichromatic 
1ZMM queries correctly with high probability in expected 0(^n e + n a \e(2 + e)-lZMM b,y (<z)|) iime 
using d(^n l+e + X^eY \b-'RMM'b,y (y)\) = 0{\n l+e + n 2 ) space, where n = max{|5|, \Y\}, g < 
t: — — ^rrr < 1 and a < eg < e. 

Theorem 5.5 (case s = 2). Given a query point q G (M^,^); Algorithm^ answers bichromatic 
TZAfAf queries correctly with high probability in expected 0(^n e + n a \e(2 + e)-1ZNN b,y (q)\) time 
using d(\n l+e + Y^yaY \£-TWNb,y (y)\) = 0(\n l+e + n 2 ) space, where n = max{|5|, \Y\}, g < 
i +£ 2/( 1+£ ) < 1 and a < eg < e. 



6 Conclusion 

We have introduced a novel algorithm for answering (monochromatic or bichromatic) 1ZNN queries 
that is both provably correct and efficient in all dimensions. Our approach is based on a reduction 
of the problem to standard e-NN search plus a controlled number of exhaustive r-V '££B queries, 
for which we propose a speed-up of the original LSH scheme based on a non-isometric lifting of the 
data. Along the way, we obtain a new method for answering exact MM queries, whose complexity 
bounds reflect the gap in difficulty that exists between exact and approximate queries on a given 
instance. 

Note that the non-isometric lifting trick can be used in a more aggressive way by applying 
liftings with ever more distortion, so as to reduce the exponent a to arbitrarily small positive 
constants. However, this comes at the price of a steady degradation of the exponent g, which gets 
closer and closer to 1. The question is how far up in distortion one can go before the increase of 
g starts compensating for the reduction of a. Another question in the same vein is whether a can 
be made dependent on n. For instance, can a be reduced to ll ^ I n , so the output-sensitive term in 
the query time depends on Inn instead of n®*- 1 - 1 ? More generally, how far from the optimal do our 
complexity bounds stand? 

In this paper we only cared about sublinear query time and polynomial space usage. In practice 
the degree of the polynomial in the space bound matters, and in this respect the almost-cubic bound 



of Theorem 4.3 for exact MM search is not quite satisfactory. Moreover, the current preprocessing 
time may not be so good due to the fact that some proximity sets, such as e-TZMM p{y) in step ii 
of the 1ZMM procedure, are computed exactly. To speed up the process one could compute them 
approximately, like in previous literature [15]. Then, the outcome of the query would likely not be 
exact, however it might still be approximately correct. In other words, solving approximate MM 
and 1ZMM queries might help speed up the preprocessing times and reduce the size of the data 
structures. 
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